Pre-trained language models, despite their rapid advancements powered by scale, still fall short of robust commonsense capabilities. And yet, scale appears to be the winning recipe; after all, the largest models seem to have acquired the largest amount of commonsense capabilities. Or is it? In this paper, we investigate the possibility of a seemingly impossible match: can smaller language models with dismal commonsense capabilities (i.e., GPT-2), ever win over models that are orders of magnitude larger and better (i.e., GPT-3), if the smaller models are powered with novel commonsense distillation algorithms? The key intellectual question we ask here is whether it is possible, if at all, to design a learning algorithm that does not benefit from scale, yet leads to a competitive level of commonsense acquisition. In this work, we study the generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce a novel commonsense distillation framework, I2D2, that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale models as the teacher model by two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-Tomic, that is of the largest and highest quality available to date.
translated by 谷歌翻译
我们介绍了Realtime QA,这是一个动态的问答(QA)平台,该平台宣布问题并定期评估系统(此版本每周)。实时质量检查询问当前世界,质量检查系统需要回答有关新事件或信息的问题。因此,它挑战了QA数据集中的静态,常规假设,并追求瞬时应用。我们在包括GPT-3和T5在内的大型语言模型上建立了强大的基线模型。我们的基准是一项持续的努力,该初步报告在过去一个月中提出了实时评估结果。我们的实验结果表明,GPT-3通常可以根据新的退休文档正确更新其生成结果,从而突出了最新信息检索的重要性。尽管如此,我们发现GPT-3倾向于在检索文件时返回过时的答案,这些文件没有提供足够的信息来找到答案。这表明了未来研究的重要途径:开放式域质量检查系统是否可以确定无法回答的案例,并与用户甚至检索模块进行通信以修改检索结果?我们希望实时质量检查能够刺激问题答案及其他问题的瞬时应用。
translated by 谷歌翻译
最终用户如何提供反馈,如果部署的结构化预测模型产生不一致的输出,忽略人类语言的结构复杂性?这是一个新兴主题,最近合成或约束设置的进展,下一个大的飞跃需要在现实世界中进行测试和调整模型。我们呈现了一个新的DataSet,interscript,包含有关已部署模型的用户反馈,该模型生成复杂的日常任务。依据包含8,466个数据点 - 输入是可能是错误的脚本和用户反馈,输出是修改的脚本。我们分散了两种用例,这可能会在互动学习中显着推进最先进的。数据集可用于:https://github.com/allenai/interscript。
translated by 谷歌翻译
自然语言处理研究人员已经确定了对生成任务的评估方法的局限性,具有新的问题,提出了自动指标和人群判断的有效性。同时,改善生成模型的努力倾向于专注于简单的n-gram重叠度量(例如,Bleu,Rouge)。我们认为,对模型和指标的新进展应该每个人都更直接受益并告知另一个。因此,我们提出了排行榜,竞争排行榜(广告牌)的概括,同时跟踪语言生成任务和指标的进展。与通过预定度量分类提交系统的传统的单向排行榜不同,广告牌可接受发电机和评估度量作为竞争条目。广告牌会自动创建一个基于跨发电机的全局分析选择和线性地组合一些指标的集合度量。此外,指标基于与人类判断的相关性进行排序。我们释放了用于机器翻译,摘要和图像标题的四个广告牌。我们展示了一些多样化度量的线性集合有时会在隔离中显着优于现有的度量。我们的混合效果模型分析表明,大多数自动度量,尤其是基于参考的机器,对人类发电的重估,展示了更新度量的重要性,将来变得更强大(也许与人类更相似)。
translated by 谷歌翻译
我们建立了一种基于规校的图像标题模型的人类评估协议。我们的得分标准及其定义是基于MSCOCO数据集上的机器和人类生成的标题仔细开发。每个字幕沿着权衡(精确和召回)中的两个主要尺寸以及测量文本质量的其他方面(流利,简洁,包容性语言)。我们的评估表明了当前评估实践的几个关键问题。人生成的标题显示出比机器生成的字块的质量大得多,特别是在突出信息的覆盖范围内(即,召回),而所有自动度量都可以说相反。我们基于规度的标准结果表明,曲线芯片,最近使用图像特征的度量标准,与人类判断更好地相关,因为它对召回更敏感。我们希望这项工作将推动更透明的图像标题和自动指标的评估协议。
translated by 谷歌翻译
随着人工智能系统变得越来越强大和普遍,人们对机器的道德或缺乏道德的关注变得越来越关注。然而,向机器讲授道德是一项艰巨的任务,因为道德仍然是人类中最激烈的争论问题之一,更不用说AI了。但是,部署到数百万用户的现有AI系统已经在做出充满道德影响的决策,这构成了一个看似不可能的挑战:教学机器的道德意义,而人类继续努力努力。为了探索这一挑战,我们介绍了Delphi,这是一个基于深层神经网络的实验框架,直接训练了描述性道德判断,例如,“帮助朋友”通常是不错的,而“帮助朋友传播假新闻”不是。经验结果提供了对机器伦理的承诺和局限性的新见解。面对新的道德情况,德尔菲(Delphi)表现出强大的概括能力,而现成的神经网络模型表现出明显差的判断,包括不公正的偏见,证实了对明确教学机器的道德意义的必要性。然而,德尔菲并不完美,表现出对普遍性偏见和不一致的敏感性。尽管如此,我们还是展示了不完美的Delphi的积极用例,包括在其他不完美的AI系统中将其用作组件模型。重要的是,我们根据著名的道德理论来解释Delphi的运营化,这使我们提出了重要的未来研究问题。
translated by 谷歌翻译
近年来带来了对自然语言理解领域的勤义代表和推理的重新兴趣。新的致辞知识图表(CSKG)的发展是这些进步的核心,因为他们的不同事实可以通过机器学习模型来解决新的和具有挑战性的任务。与此同时,由于全面地涵盖了一般勤杂朗知识所需的大规模规模,对这些资源的质量和覆盖率仍存在疑问。在这项工作中,我们将手动构建的CSKGS分配在NLP代理商遇到的所有情况下,我们将永远不会实现适用所需的覆盖范围。因此,我们提出了一种新的评估框架,用于测试KGS的效用,基于如何从中学习有效的隐式知识表示。通过这一新目标,我们提出了一个含有知识的全新CSKG的新CSKG,该知识不容易获得预用的语言模型。我们与其他领先的CSKG相比,评估其属性,表现了对勤杂朗语言知识资源的第一个大规模对研究。接下来,我们显示原子2020更适合培训知识模型,可以为新的,看不见的实体和事件产生准确,代表知识。最后,通过人类评估,我们表明,尽管使用超过430倍的参数,但GPT-3(175B参数)的几次射击性能较低,而令人印象深刻,令人印象深刻,令人印象深刻,令人印象深刻,仍然低于原子型2020的巴特的知识模型。
translated by 谷歌翻译
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WINOGRANDE, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WINOGRANDE achieve 59.4 -79.1%, which are ∼15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% -100% respectively). Furthermore, we establish new state-of-the-art results on five related benchmarks -WSC (→ 90.1%), DPR (→ 93.1%), COPA(→ 90.6%), KnowRef (→ 85.6%), and Winogender (→ 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
translated by 谷歌翻译
We consider task allocation for multi-object transport using a multi-robot system, in which each robot selects one object among multiple objects with different and unknown weights. The existing centralized methods assume the number of robots and tasks to be fixed, which is inapplicable to scenarios that differ from the learning environment. Meanwhile, the existing distributed methods limit the minimum number of robots and tasks to a constant value, making them applicable to various numbers of robots and tasks. However, they cannot transport an object whose weight exceeds the load capacity of robots observing the object. To make it applicable to various numbers of robots and objects with different and unknown weights, we propose a framework using multi-agent reinforcement learning for task allocation. First, we introduce a structured policy model consisting of 1) predesigned dynamic task priorities with global communication and 2) a neural network-based distributed policy model that determines the timing for coordination. The distributed policy builds consensus on the high-priority object under local observations and selects cooperative or independent actions. Then, the policy is optimized by multi-agent reinforcement learning through trial and error. This structured policy of local learning and global communication makes our framework applicable to various numbers of robots and objects with different and unknown weights, as demonstrated by numerical simulations.
translated by 谷歌翻译
Artificial life is a research field studying what processes and properties define life, based on a multidisciplinary approach spanning the physical, natural and computational sciences. Artificial life aims to foster a comprehensive study of life beyond "life as we know it" and towards "life as it could be", with theoretical, synthetic and empirical models of the fundamental properties of living systems. While still a relatively young field, artificial life has flourished as an environment for researchers with different backgrounds, welcoming ideas and contributions from a wide range of subjects. Hybrid Life is an attempt to bring attention to some of the most recent developments within the artificial life community, rooted in more traditional artificial life studies but looking at new challenges emerging from interactions with other fields. In particular, Hybrid Life focuses on three complementary themes: 1) theories of systems and agents, 2) hybrid augmentation, with augmented architectures combining living and artificial systems, and 3) hybrid interactions among artificial and biological systems. After discussing some of the major sources of inspiration for these themes, we will focus on an overview of the works that appeared in Hybrid Life special sessions, hosted by the annual Artificial Life Conference between 2018 and 2022.
translated by 谷歌翻译